Arborest – a VISL-Style Treebank Derived from an Estonian Constraint Grammar Corpus

نویسندگان

  • Eckhard Bick
  • Heli Uibo
  • Kaili Müürisep
چکیده

Treebank creation is a very labor-consuming task, especially if the applications intended include machine learning, gold standard parser evaluation or teaching, since only a manually checked syntactically annotated corpus can provide optimal support for these purposes. There are, however, possibilities to make the annotation process (partly) automatic, saving (manual) annotation time and/or allowing the creation of larger corpora. Whenever possible, existing resources – both corpora and grammars – should be reused. In the case of the Estonian treebank project Arborest, we have therefore opted to make use of existing technology and experiences from the VISL project, where two-stage systems including both Constraint Grammar (CG)and Phrase Structure Grammar (PSG)-parsers have been used to build treebanks for several languages (Bick, 2003 [1]). Moreover, the VISL annotation scheme has been adopted as a standard for tagging the parallel corpus in Nordic Treebank Network. For Estonian, there already exists a shallow syntactically annotated – and proof-read – corpus, allowing us to bypass the first step in treebank construction (CG-parsing). This paper describes how a VISL-style hybrid treebank of Estonian has been semi-automatically derived from this corpus with a special Phrase Structure Grammar, using as terminals not words, but CG function tags. We will analyze the results of the experiment and look more thoroughly at adverbials, non-finite verb constructions and complex noun phrases. The questions we will try to answer are:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Syntactically annotated corpora of Estonian

Syntactically annotated corpora are needed 1) to train and test parsers and various language technological products grammar checkers, information retrievers and extractors, machine translators etc; 2) to check the agreement of existing linguistic theories with the real language usage. The corpora can be annotated on different levels of depth. In shallow syntactically annotated corpora a syntact...

متن کامل

Arborest – a Growing Treebank of Estonian

Treebank creation is a very labor-consuming task, especially if the applications intended include machine learning, gold standard parser evaluation or teaching, since only a manually checked syntactically annotated corpus can provide optimal support for these purposes. There are, however, possibilities to make the annotation process (partly) automatic, saving (manual) annotation time and/or all...

متن کامل

Floresta Sintá(c)tica: A treebank for Portuguese

This paper reviews the first year of the creation of a publicly available treebank for Portuguese, Floresta Sintá(c)tica, a collaboration project between the VISL and the Computational Processing of Portuguese projects. After briefly describing the main goals and the organization of the project, the creation of the annotated objects is presented in detail: preparing the text to be annotated, ap...

متن کامل

Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies

This paper presents the first version of Estonian Universal Dependencies Treebank which has been semi-automatically acquired from Estonian Dependency Treebank and comprises ca 400,000 words (ca 30,000 sentences) representing the genres of fiction, newspapers and scientific writing. Article analyses the differences between two annotation schemes and the conversion procedure to Universal Dependen...

متن کامل

The VISL System: Research and applicative aspects of IT-based learning

The paper presents an integrated inter active user interface for teaching grammatical analysis through the Internet medium (Visual Interactive Syntax Learning), developed at Southern Denmark University, covering 14 different languages , half of which are supported by live grammatical analysis of running text. For reasons of robustness, efficiency and correctness, the system's internal tools are...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004